Reading Excel Data Via the Knowledge Network

Python provides plenty of tools for reading and visualising different kinds of tabular data. Here we demonstrate using pandas and xlrd to load and manipulate excel data and the display this data in a matplotlib graph. Using a simple custom widget defined in the 'Filtering Widget.ipynb'.

This notebook was developed by the Oznome Project and is available at https://github.com/oznome/jupyter-examples. The filtering widget was produced in conjunction with the Oznome project as part of the CSIRO EUDM project.


In [155]:
import pandas, xlrd, requests, json
from pandas import np
import matplotlib.pyplot as plt
pandas.set_option('display.max_columns', 500)
%matplotlib inline

We request a data record from the CSIRO Knowledge Network


In [82]:
response = requests.get("http://kn.csiro.au/api/dataset?id=http%3A%2F%2Foznome.csiro.au%2Fid%2Fdata%2Fdata-gov-au%2Fsaimos-biological-and-flow-cytometry-data-collected-from-ctd-stations-in-south-australia-i-20142")
json_data = response.json()

Exploring the json respone data we can then figure out how to query this record for futher details. First though lets create a class (from https://stackoverflow.com/questions/18873066/pretty-json-formatting-in-ipython-notebook) to make our json easier to explore.


In [159]:
import uuid
from IPython.display import display_javascript, display_html, display
import json

class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json_data
        self.uuid = str(uuid.uuid4())

    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;"></div>'.format(self.uuid), raw=True)
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
        document.getElementById('%s').appendChild(renderjson(%s))
        });
        """ % (self.uuid, self.json_str), raw=True)

In [160]:
json_data = response.json()
RenderJSON(json_data)


By using a python list comprehension we can query a particular excel workbook that is referenced in the metadata record


In [84]:
url = [resource for resource in json_data["resources"] if "Picophytoplankton" in resource["name"]][0]["url"]
url


Out[84]:
'http://data.aodn.org.au/IMOS/SAIMOS/Flow_Cytometry/2014/07/2014_07_SAIMOS_ANMN_Picophyto_FV02.xls'

Then using requests we can query the specific excel workbook and open it using xlrd


In [85]:
r = requests.get(url)
book = xlrd.open_workbook(file_contents=r.content)

Then list the various sheets contained in this workbook


In [86]:
book.sheet_names()


Out[86]:
['RAW',
 'Values for Conversions',
 'Converted_Normalised_QAQC',
 'Converted_CLEAN',
 'emII format']

Using pandas we can read one of these sheets and do some exporatory data analysis starting with listing the available columns


In [87]:
dataframe = pandas.read_excel(url,sheetname='Converted_CLEAN')
dataframe.columns


Out[87]:
Index(['Flag', 'Flag comments', 'Station', 'Rep', 'Depth (category)',
       'Depth (m)', 'Prochlorococus', 'Prochloroccus Norm FSC',
       'Prochloroccus NormSSC', 'Prochloroccus  Norm Orange',
       'Prochloroccus  Red', 'Prochlococcus Norm FarRed', 'Synechococcus ',
       'Synechococcus Norm FSC', 'Synechococcus Norm SSC',
       'Synechococcus Norm Orange', 'Synechococcus Norm Chla',
       'Synechococcus Norm FarRed', 'Picoeukaryotes',
       'Picoeukaryotes Norm FSC', 'PicoeukaryotesNorm SSC',
       'Picoeukaryotes Norm Orange', 'Picoeukaryotes Norm Chla',
       'Picoeukaryotes Norm FarRed'],
      dtype='object')

The dataframe describe function tells us more information about these columns


In [88]:
dataframe.describe(include='all')


Out[88]:
Flag Flag comments Station Rep Depth (category) Depth (m) Prochlorococus Prochloroccus Norm FSC Prochloroccus NormSSC Prochloroccus Norm Orange Prochloroccus Red Prochlococcus Norm FarRed Synechococcus Synechococcus Norm FSC Synechococcus Norm SSC Synechococcus Norm Orange Synechococcus Norm Chla Synechococcus Norm FarRed Picoeukaryotes Picoeukaryotes Norm FSC PicoeukaryotesNorm SSC Picoeukaryotes Norm Orange Picoeukaryotes Norm Chla Picoeukaryotes Norm FarRed
count 27.000000 1 27 27.00000 27 27.000000 27.000000 27.000000 27.000000 27.00000 27.000000 27.00000 27.000000 27.000000 27.000000 27.000000 27.00000 27.000000 27.000000 27.000000 27.000000 27.000000 27.000000 27.000000
unique NaN 1 3 NaN 3 NaN 26.000000 27.000000 25.000000 24.00000 10.000000 22.00000 27.000000 22.000000 18.000000 26.000000 24.00000 26.000000 27.000000 26.000000 26.000000 24.000000 27.000000 25.000000
top NaN missing sample SAM5CB NaN Bottom NaN 3481.481481 0.147575 0.093142 0.06694 0.034662 0.90681 14330.864198 0.216195 0.056802 1.594755 6.68876 9.272416 10686.419753 1.570825 0.253775 0.027233 33.817241 28.880792
freq NaN 1 10 NaN 9 NaN 2.000000 1.000000 2.000000 2.00000 7.000000 2.00000 1.000000 2.000000 3.000000 2.000000 2.00000 2.000000 1.000000 2.000000 2.000000 3.000000 1.000000 3.000000
mean 1.296296 NaN NaN 2.00000 NaN 38.777778 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 1.539601 NaN NaN 0.83205 NaN 27.237324 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 1.000000 NaN NaN 1.00000 NaN 5.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 1.000000 NaN NaN 1.00000 NaN 5.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 1.000000 NaN NaN 2.00000 NaN 42.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 1.000000 NaN NaN 3.00000 NaN 65.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 9.000000 NaN NaN 3.00000 NaN 75.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Sometime it is useful to build widgets to help explore data and make a notebook user friendly you can define widgets inline a notebook but it is also easy to define widgets in an adjacent notebook and then run them using %run


In [89]:
%run Filtering\ Widget.ipynb

The above code creates a widget class called FilteringWidget which we can then use to create a widget


In [90]:
filtered = FilteringWidget(dataframe, ['Station', 'Depth (m)'])


Interacting with this widget produces a filtered dataframe available by accessing the dataframe property. For example we can restrict the dataframe to the NRSKAI station


In [91]:
filtered.dataframe


Out[91]:
Flag Flag comments Station Rep Depth (category) Depth (m) Prochlorococus Prochloroccus Norm FSC Prochloroccus NormSSC Prochloroccus Norm Orange Prochloroccus Red Prochlococcus Norm FarRed Synechococcus Synechococcus Norm FSC Synechococcus Norm SSC Synechococcus Norm Orange Synechococcus Norm Chla Synechococcus Norm FarRed Picoeukaryotes Picoeukaryotes Norm FSC PicoeukaryotesNorm SSC Picoeukaryotes Norm Orange Picoeukaryotes Norm Chla Picoeukaryotes Norm FarRed
NRSKAI_D1R1.fcs 1 NaN NRSKAI 1 Surface 5 6083.95 0.172377 0.0403108 0.0823658 0.0331791 0.90681 14528.4 0.2658 0.0577178 2.85348 11.2611 14.5985 6923.46 2.72579 0.344169 0.026757 46.4211 41.3718
NRSKAI_D1R2.fcs 1 NaN NRSKAI 2 Surface 5 7343.21 0.231077 0.0461131 0.0716059 0.0346615 0.834042 14330.9 0.298457 0.0638255 1.64408 6.91678 9.27242 6824.69 2.75101 0.350582 0.0272331 44.8913 42.1359
NRSKAI_D1R3.fcs 1 NaN NRSKAI 3 Surface 5 6908.64 0.245131 0.0540531 0.0669401 0.0331791 0.831243 13940.7 0.296803 0.0638255 1.67445 7.1095 9.58868 7259.26 2.91471 0.37715 0.028471 47.5661 43.7032
NRSKAI_D2R1.fcs 1 NaN NRSKAI 1 DCM 65 4869.14 0.147575 0.0400054 0.0728438 0.0324731 0.83684 10775.3 0.19594 0.0458077 1.77967 7.1095 9.16046 6518.52 1.69484 0.249805 0.0184728 35.9414 30.1403
NRSKAI_D2R2.fcs 1 NaN NRSKAI 2 DCM 65 5520.99 0.190152 0.0470293 0.0756052 0.0310613 0.892816 10938.3 0.249678 0.0601608 1.79053 7.50977 10.2828 6074.07 1.84035 0.309355 0.0212342 38.667 34.358
NRSKAI_D3R1.fcs 1 NaN NRSKAI 3 DCM 65 5659.26 0.361703 0.114519 0.063893 0.0346615 0.789261 6943.21 0.216195 0.057107 1.59476 7.13068 9.67544 3846.91 1.62415 0.27729 0.0272331 33.9203 30.2298
NRSKAI_D3R1.fcs 9 missing sample NRSKAI 1 Bottom 75 - - - - - - - - - - - - - - - - - -
NRSKAI_D3R2.fcs 1 NaN NRSKAI 2 Bottom 75 5624.69 0.396427 0.116352 0.0610364 0.0352969 0.817249 7135.8 0.217435 0.0568016 1.41612 6.64852 9.13248 3679.01 1.58033 0.270571 0.0272331 33.8172 30.5097
NRSKAI_D3R3.fcs 1 NaN NRSKAI 3 Bottom 75 5540.74 0.325326 0.114519 0.0658927 0.0346615 0.870426 7274.07 0.216195 0.0577178 1.59476 7.32905 10.1904 3446.91 1.57082 0.279733 0.0213294 33.3061 29.684

In [174]:
filtered_frame = filtered.dataframe.replace('-', np.nan)

Thinking about what this data might be we can make some experimental assumptions. Perhaps Rep is experiment repeat number. We can further restrict the data frame to particular organism by specifying a columns


In [175]:
filtered_frame = filtered_frame[[ 'Synechococcus ','Prochlorococus', 'Picoeukaryotes', 'Rep', 'Depth (category)']]

It looks like there is data for all the organisms in "Rep 2" so lets filter to that


In [176]:
filtered_frame = filtered_frame.loc[filtered_frame['Rep'] == 2]

In [177]:
filtered_frame


Out[177]:
Synechococcus Prochlorococus Picoeukaryotes Rep Depth (category)
NRSKAI_D1R2.fcs 14330.864198 7343.209877 6824.691358 2 Surface
NRSKAI_D2R2.fcs 10938.271605 5520.987654 6074.074074 2 DCM
NRSKAI_D3R2.fcs 7135.802469 5624.691358 3679.012346 2 Bottom

The current index looks like a unique identifier for the experiment but we have restricted the data so that depth also provides a unique identifier in our filtered dataset lets get rid of the old index and set the new index to the depth so we can create a visualisation of this data


In [178]:
filtered_frame = filtered_frame.set_index('Depth (category)')

we also won't need the rep column


In [179]:
filtered_frame.pop('Rep')
pass

In [180]:
filtered_frame.plot(kind='bar', stacked=True)


Out[180]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f67fc585668>